import pandas as pd
import numpy as np
1 Goal
Today I chose a dataset retrieved from Kaggle. With data regarding the air quality of Delhi, I want to try to create a normal distribution of some of the data
= pd.read_csv('data/day25/delhi_air_quality.csv') df
5) df.head(
Date | Month | Year | Holidays_Count | Days | PM2.5 | PM10 | NO2 | SO2 | CO | Ozone | AQI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2021 | 0 | 5 | 408.80 | 442.42 | 160.61 | 12.95 | 2.77 | 43.19 | 462 |
1 | 2 | 1 | 2021 | 0 | 6 | 404.04 | 561.95 | 52.85 | 5.18 | 2.60 | 16.43 | 482 |
2 | 3 | 1 | 2021 | 1 | 7 | 225.07 | 239.04 | 170.95 | 10.93 | 1.40 | 44.29 | 263 |
3 | 4 | 1 | 2021 | 0 | 1 | 89.55 | 132.08 | 153.98 | 10.42 | 1.01 | 49.19 | 207 |
4 | 5 | 1 | 2021 | 0 | 2 | 54.06 | 55.54 | 122.66 | 9.70 | 0.64 | 48.88 | 149 |
# Get an understanding of the
df.describe()
Date | Month | Year | Holidays_Count | Days | PM2.5 | PM10 | NO2 | SO2 | CO | Ozone | AQI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
mean | 15.729637 | 6.522930 | 2022.501027 | 0.189596 | 4.000684 | 90.774538 | 218.219261 | 37.184921 | 20.104921 | 1.025832 | 36.338871 | 202.210815 |
std | 8.803105 | 3.449884 | 1.118723 | 0.392116 | 2.001883 | 71.650579 | 129.297734 | 35.225327 | 16.543659 | 0.608305 | 18.951204 | 107.801076 |
min | 1.000000 | 1.000000 | 2021.000000 | 0.000000 | 1.000000 | 0.050000 | 9.690000 | 2.160000 | 1.210000 | 0.270000 | 2.700000 | 19.000000 |
25% | 8.000000 | 4.000000 | 2022.000000 | 0.000000 | 2.000000 | 41.280000 | 115.110000 | 17.280000 | 7.710000 | 0.610000 | 24.100000 | 108.000000 |
50% | 16.000000 | 7.000000 | 2023.000000 | 0.000000 | 4.000000 | 72.060000 | 199.800000 | 30.490000 | 15.430000 | 0.850000 | 32.470000 | 189.000000 |
75% | 23.000000 | 10.000000 | 2024.000000 | 0.000000 | 6.000000 | 118.500000 | 297.750000 | 45.010000 | 26.620000 | 1.240000 | 45.730000 | 284.000000 |
max | 31.000000 | 12.000000 | 2024.000000 | 1.000000 | 7.000000 | 1000.000000 | 1000.000000 | 433.980000 | 113.400000 | 4.700000 | 115.870000 | 500.000000 |
Thus we see that there is four years of data available, with recordings everyday for those four years. It would now be interesting to plot the PM2.5 column.
df.columns
Index(['Date', 'Month', 'Year', 'Holidays_Count', 'Days', 'PM2.5', 'PM10',
'NO2', 'SO2', 'CO', 'Ozone', 'AQI'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1461 non-null int64
1 Month 1461 non-null int64
2 Year 1461 non-null int64
3 Holidays_Count 1461 non-null int64
4 Days 1461 non-null int64
5 PM2.5 1461 non-null float64
6 PM10 1461 non-null float64
7 NO2 1461 non-null float64
8 SO2 1461 non-null float64
9 CO 1461 non-null float64
10 Ozone 1461 non-null float64
11 AQI 1461 non-null int64
dtypes: float64(6), int64(6)
memory usage: 137.1 KB
import altair as alt
alt.Chart(df).mark_point().encode(='Month',
x='PM2.5'
y )
Can’t plot the PM2.5 column for whatever reason.
= df.rename(columns={'PM2.5': 'PM2_5'}) df
# Trying againg with the new column name
alt.Chart(df).mark_point().encode(='Month',
x='PM2_5'
y )
That did the trick. We can clearly see that PM2.5 particals are generally lowest in July-September. With December and January being the worst. There is however an outlier in June with a PM2.5 of a 1000, maybe the instrument that measured couldn’t read above that threshold.
2 Calculating the normal distribution for 2024 of PM2.5
import math
import matplotlib.pyplot as plt
= df[df['Year'] == 2024] df_2024
def normal_pdf(x, mu=0, sigma=1):
= math.sqrt(2 * math.pi)
sqrt_two_pi return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))
# Storing the mean value of PM2.5 in 2024
= df_2024['PM2_5'].mean()
mu
# Storing the standard deviation of PM2.5
= df_2024['PM2_5'].std() sigma
# Remove outlier at 1000 PM2_5
= df_2024[df_2024['PM2_5'] < df_2024['PM2_5'].quantile(0.99)]
df_2024
# Creating a array of continuous values to plot probability for each value.
# As the pm2_5 column can't be used as-is, due to it missing values in the values between min and max
= np.arange(min(df_2024['PM2_5']), max(df_2024['PM2_5']))
xs
# Storing y values of the function
= []
y for x in xs:
=mu, sigma=sigma)) y.append(normal_pdf(x, mu
# plotting distribution
plt.plot(xs, y)"Normal distribution of PM2.5 in Delhi 2024")
plt.title( plt.show()
3 Reflections
We thus have a probability density distribution, where we can understand the probability of PM2.5 being any given value.
Besides calculating the normal distribution, it could be interesting to use linear regression, to be able to approximate the pm2.5 on any given day.